Back to notes Concepts January 3, 2026 143 words

LR Schedules

Techniques used to update learning rate for a better learning.

Learning Rate Warmup and Decay

Practice of gradually increasing learning rate η\eta to its peak value. This allows for optimization algorithms to adapt and initialize. This theory (that Adam-like optimizations working better with warmup) was validated by RAdam (2019) paper.

TODO: Loss Catapult theory

Inverse Square Root

η\eta is decayed proportional to 1/step1 / \sqrt{\text{step}}. Starts at maximum LR, halves by 4th step, so the decay is aggressive at start.

Linear Warmup + Cosine Decay

η\eta increases linearly then follows cosine curve to decay near zero. Strategy that's used in GPT-3, PaLM and Llama 2 (2018-2022).

+ Effective when total number of steps are known.

Warmup-Stable-Decay (WSD)

η\eta increases linearly, stay at peak for a long period and decays in a short period. Decay can be linear or square rooted.

TODO: Schedule-Free Optimizers